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Abstract 

We  present  a  system  submitted  to  the  CoNLL- 
2004  shared  task  for  semantic  role  labeling. 

The  system  is  composed  of  a  set  of  classifiers 
and  an  inference  procedure  used  both  to  clean 
the  classification  results  and  to  ensure  struc¬ 
tural  integrity  of  the  final  role  labeling.  Lin¬ 
guistic  information  is  used  to  generate  features 
during  classification  and  constraints  for  the  in¬ 
ference  process. 

1  Introduction 

Semantic  role  labeling  is  a  complex  task  to  discover  pat¬ 
terns  within  sentences  corresponding  to  semantic  mean¬ 
ing.  We  believe  it  is  hopeless  to  expect  high  levels  of  per¬ 
formance  from  either  purely  manual  classifiers  or  purely 
learned  classifiers.  Rather,  supplemental  linguistic  infor¬ 
mation  must  be  used  to  support  and  correct  a  learning 
system.  The  system  we  present  here  is  composed  of  two 
phases. 

First,  a  set  of  phrase  candidates  is  produced  using  two 
learned  classifiers — one  to  discover  beginning  positions 
and  one  to  discover  end  positions  for  each  argument  type. 
Hopefully,  this  phase  discovers  a  small  superset  of  all 
phrases  in  the  sentence  (for  each  verb). 

In  the  second  phase,  the  final  prediction  is  made.  First, 
candidate  phrases  from  the  first  phase  are  re-scored  using 
a  classifier  designed  to  determine  argument  type,  given 
a  candidate  phrase.  Because  phrases  are  considered  as  a 
whole,  global  properties  of  the  candidates  can  be  used  to 
discover  how  likely  it  is  that  a  phrase  is  of  a  given  ar¬ 
gument  type.  However,  the  set  of  possible  role-labelings 
is  restricted  by  structural  and  linguistic  constraints.  We 
encode  these  constraints  using  linear  functions  and  use 
integer  programming  to  ensure  the  final  prediction  is  con¬ 
sistent  (see  Section  4). 


2  SNoW  Learning  Architecture 

The  learning  algorithm  used  is  a  variation  of  the  Winnow 
update  rule  incorporated  in  SNoW  (Roth,  1998;  Roth  and 
Yih,  2002),  a  multi-class  classifier  that  is  specifically  tai¬ 
lored  for  large  scale  learning  tasks.  SNoW  learns  a  sparse 
network  of  linear  functions,  in  which  the  targets  (phrase 
border  predictions  or  argument  type  predictions,  in  this 
case)  are  represented  as  linear  functions  over  a  common 
feature  space.  It  incorporates  several  improvements  over 
the  basic  Winnow  update  rule.  In  particular,  a  regular¬ 
ization  term  is  added,  which  has  the  affect  of  trying  to 
separate  the  data  with  a  think  separator  (Grove  and  Roth, 
2001;  Hang  et  al.,  2002).  In  the  work  presented  here  we 
use  this  regularization  with  a  fixed  parameter. 

Experimental  evidence  has  shown  that  SNoW  activa¬ 
tions  are  monotonic  with  the  confidence  in  the  prediction 
Therefore,  it  can  provide  a  good  source  of  probability  es¬ 
timation.  We  use  softmax  (Bishop,  1995)  over  the  raw  ac¬ 
tivation  values  as  conditional  probabilities.  Specifically, 
suppose  the  number  of  classes  is  n,  and  the  raw  activa¬ 
tion  values  of  class  i  is  acti.  The  posterior  estimation  for 
class  i  is  derived  by  the  following  equation. 

^acti 


3  First  Phase:  Find  Argument  Candidates 

The  first  phase  is  to  predict  the  phrases  of  a  given  sen¬ 
tence  that  correspond  to  some  argument  (given  the  verb). 
Unfortunately,  it  turns  out  that  it  is  difficult  to  predict  the 
exact  phrases  accurately.  Therefore,  the  goal  of  the  first 
phase  is  to  output  a  superset  of  the  correct  phrases  by  fil¬ 
tering  out  unlikely  candidates. 

Specifically,  we  learn  two  classifiers,  one  to  detect 
beginning  phrase  locations  and  a  second  to  detect  end 
phrase  locations.  Each  multi-class  classifier  makes  pre¬ 
dictions  over  forty-three  classes  -  thirty-two  argument 
types,  ten  continuous  argument  types,  one  class  to  detect 
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not  begging  and  one  class  to  detect  not  end.  The  follow¬ 
ing  features  are  used: 

•  Word  feature  includes  the  current  word,  two  words 
before  and  two  words  after. 

•  Part-of-speech  tag  (POS)  feature  includes  the  POS 
tags  of  the  current  word,  two  words  before  and  after. 

•  Chunk  feature  includes  the  BIO  tags  for  chunks  of 
the  current  word,  two  words  before  and  after. 

•  Predicate  lemma  &  POS  tag  show  the  lemma  form 
and  POS  tag  of  the  active  predicate. 

•  Voice  feature  indicates  the  voice  (active/passive)  of 
the  current  predicate.  This  is  extracted  with  a  simple 
rule:  a  verb  is  identified  as  passive  if  it  follows  a  to- 
be  verb  in  the  same  phrase  chuck  and  its  POS  tag 
is  VBN(past  participle)  or  it  immediately  follows  a 
noun  phrase. 

•  Position  feature  describes  if  the  current  word  is  be¬ 
fore  of  after  the  predicate. 

•  Chunk  pattern  feature  encodes  the  sequence  of 
chunks  from  the  current  words  to  the  predicate. 

•  Clause  tag  indicates  the  boundary  of  clauses. 

•  Clause  path  feature  is  a  path  formed  from  a  semi- 
parsed  tree  containing  only  clauses  and  chunks. 
Each  clause  is  named  with  the  chunk  immediately 
preceding  it.  The  clause  path  is  the  path  from  predi¬ 
cate  to  target  word  in  the  semi-parsed  tree. 

•  Clause  position  feature  is  the  position  of  the  tar¬ 
get  word  relative  to  the  predicate  in  the  semi-parsed 
tree  containing  only  clauses.  Specifically,  there 
are  four  configurations — target  word  and  predicate 
share  same  parent,  parent  of  target  word  is  ancestor 
of  predicate,  parent  of  predicate  is  ancestor  of  target 
word,  or  otherwise. 

Because  each  phrase  consists  of  a  single  beginning  and 
a  single  ending,  these  classifiers  can  be  used  to  construct 
a  set  of  potential  phrases  (by  combining  each  predicted 
begin  with  each  predicted  end  after  it  of  the  same  type). 

Although  the  outputs  of  this  phase  are  potential  ar¬ 
gument  candidates,  along  with  their  types,  the  second 
phase  re-scores  the  arguments  using  all  possible  types. 
After  eliminating  the  types  from  consideration,  the  first 
phase  achieves  98.96%  and  88.65%  recall  (overall,  with¬ 
out  verb)  on  the  training  and  the  development  set,  respec¬ 
tively.  Because  these  are  the  only  candidates  that  are 
passed  to  the  second  phase,  88.65%  is  an  upper  bound 
of  the  recall  for  our  overall  system. 

4  Second  Phase:  Phrase  Classification 

The  second  phase  of  our  system  assigns  the  final  argu¬ 
ment  classes  to  (a  subset)  of  the  phrases  supplied  from  the 


first  phase.  This  task  is  accomplished  in  two  steps.  First, 
a  multi-class  classifier  is  used  to  supply  confidence  scores 
corresponding  to  how  likely  individual  phrases  are  to 
have  specific  argument  types.  Then  we  look  for  the  most 
likely  solution  over  the  whole  sentence,  given  the  matrix 
of  confidences  and  linguistic  information  that  serves  as  a 
set  of  global  constraints  over  the  solution  space. 

Again,  the  SNoW  learning  architecture  is  used  to  train 
a  multi-class  classifier  to  label  each  phrase  to  one  of 
the  argument  types,  plus  a  special  class  -  no  argument. 
Training  examples  are  created  from  the  phrase  candidates 
supplied  from  the  first  phase  using  the  following  features: 

•  Predicate  lemma  &  POS  tag,  voice,  position, 
clause  Path,  clause  position,  chunk  pattern  Same 
features  as  the  first  phase. 

•  Word  &  POS  tag  from  the  phrase,  including  the 
first/last  word  and  tag,  and  the  head  word'. 

•  Named  entity  feature  tells  if  the  target  phrase  is, 
embeds,  overlaps,  or  is  embedded  in  a  named  entity. 

•  Chunk  features  are  the  same  as  named  entity  (but 
with  chunks,  e.g.  noun  phrases). 

•  Length  of  the  target  phrase,  in  the  numbers  of  words 
and  chunks. 

•  Verb  class  feature  is  the  class  of  the  active  predicate 
described  in  the  frame  files. 

•  Phrase  type  uses  simple  heuristics  to  identify  the 
target  phrase  like  VP,  PP,  or  NR 

•  Sub-categorization  describes  the  phrase  structure 
around  the  predicate.  We  separate  the  clause  where 
the  predicate  is  in  into  three  part  -  the  predicate 
chunk,  segments  before  and  after  the  predicate.  The 
sequence  of  the  phrase  types  of  these  three  segments 
is  our  feature. 

•  Baseline  follows  the  rule  of  identifying  AM-NEG 
and  AM-MOD  and  uses  them  as  features. 

•  Clause  coverage  describes  how  much  of  local 
clause  (from  the  predicate)  is  covered  by  the  target 
phrase. 

•  Chunk  pattern  length  feature  counts  the  number  of 
patterns  in  the  phrase. 

•  Conjunctions  join  every  pair  of  the  above  features 
as  new  features. 

•  Boundary  words  &  POS  tags  include  one  or  two 
words/tags  before  and  after  the  target  phrase. 

*We  use  simple  rules  to  first  decide  if  a  candidate  phrase 
type  is  VP,  NP,  or  PP.  The  headword  of  an  NP  phrase  is  the 
right-most  noun.  Similarly,  the  left-most  verb/proposition  of  a 
VP/PP  phrase  is  extracted  as  the  headword 


•  Bigrams  are  pairs  of  words/tags  in  the  window  from 
two  words  before  the  target  to  the  first  word  of  the 
target,  and  also  from  the  last  word  to  two  words  after 
the  phrase. 

•  Sparse  colocation  picks  one  word/tag  from  the  two 
words  before  the  phrase,  the  first  word/tag,  the  last 
word/tag  of  the  phrase,  and  one  word/tag  from  the 
two  words  after  the  phrase  to  join  as  features. 

Alternately,  we  could  have  derived  a  scoring  function 
from  the  first  phase  confidences  of  the  open  and  closed 
predictors  for  each  argument  type.  This  method  has 
proved  useful  in  the  literature  for  shallow  parsing  (Pun- 
yakanok  and  Roth,  2001).  However,  it  is  hoped  that  ad¬ 
ditional  global  features  of  the  phrase  would  be  necessary 
due  to  the  variety  and  complexity  of  the  argument  types. 
See  Table  1  for  a  comparison. 

Formally  (but  very  briefly),  the  phrase  classifier  is  at¬ 
tempting  to  assign  labels  to  a  set  of  phrases,  ,  in¬ 
dexed  from  1  to  M.  Each  phrase  S'*  can  take  any  label 
from  a  set  of  phrase  labels,  V,  and  the  indexed  set  of 
phrases  can  take  a  set  of  labels,  G  .  If  we  as¬ 
sume  that  the  classifier  returns  a  score,  score(S*  =  s*), 
corresponding  to  the  likelihood  of  seeing  label  s*  for 
phrase  S*,  then,  given  a  sentence,  the  unaltered  inference 
task  that  is  solved  by  our  system  maximizes  the  score  of 
the  phrase,  score(S'^'^  = 

=  argmax  score(S'^''^  =  s^'^) 

gl-.M  ^'pM 

M  (1) 

=  argmax  ^  score(S'*  =  s*). 

The  second  step  for  phrase  identification  is  eliminating 
labelings  using  global  constraints  derived  from  linguistic 
information  and  structural  considerations.  Specifically, 
we  limit  the  solution  space  through  the  used  of  a  filter 
function,  T,  that  eliminates  many  phrase  labelings  from 
consideration.  It  is  interesting  to  contrast  this  with  previ¬ 
ous  work  that  filters  individual  phrases  (see  (Carreras  and 
M^quez,  2003)).  Here,  we  are  concerned  with  global 
constraints  as  well  as  constraints  on  the  phrases.  There¬ 
fore,  the  final  labeling  becomes 

M 

s^'-^  =  argmax  ^  score(S'*  =  s*)  (2) 

The  filter  function  used  considers  the  following  con¬ 
straints: 

1 .  Arguments  cannot  cover  the  predicate  except  those 
that  contain  only  the  verb  or  the  verb  and  the  follow¬ 
ing  word. 

2.  Arguments  cannot  overlap  with  the  clauses  (they  can 
be  embedded  in  one  another). 


3.  If  a  predicate  is  outside  a  clause,  its  arguments  can¬ 
not  be  embedded  in  that  clause. 

4.  No  overlapping  or  embedding  phrases. 

5.  No  duplicate  argument  classes  for  A0-A5,V. 

6.  Exactly  one  V  argument  per  sentence. 

7.  If  there  is  C-V,  then  there  has  to  be  a  V-Al-CV  pat¬ 
tern. 

8.  If  there  is  a  R-XXX  argument,  then  there  has  to  be  a 
XXX  argument. 

9.  If  there  is  a  C-XXX  argument,  then  there  has  to  be 
a  XXX  argument;  in  addition,  the  C-XXX  argument 
must  occur  after  XXX. 

10.  Given  the  predicate,  some  argument  classes  are  ille¬ 
gal  (e.g.  predicate  ’stalk’  can  take  only  AO  or  Al). 

Constraint  1  is  valid  because  all  the  arguments  of  a  pred¬ 
icate  must  lie  outside  the  predicate.  The  exception  is  for 
the  boundary  of  the  predicate  itself.  Constraint  1  through 
constraint  3  are  actually  constraints  that  can  be  evaluated 
on  a  per-phrase  basis  and  thus  can  be  applied  to  the  indi¬ 
vidual  phrases  at  any  time.  Eor  efficiency  sake,  we  elimi¬ 
nate  these  even  before  the  second  phase  scoring  is  begun. 
Constraints  5,  8,  and  9  are  valid  for  only  a  subset  of  the 
arguments. 

These  constraints  are  easy  to  transform  into  linear  con¬ 
straints  (for  example,  for  each  class  c,  constraint  5  be¬ 
comes  ['5'*  =  c]  <  1)  Then  the  optimum  solution 
of  the  cost  function  given  in  Equation  2  can  be  found  by 
integer  linear  programming^.  A  similar  method  was  used 
for  entity/relation  recognition  (Roth  and  Yih,  2004). 

Almost  all  previous  work  on  shallow  parsing  and 
phrase  classification  has  used  Constraint  4  to  ensure  that 
there  are  no  overlapping  phrases.  By  considering  addi¬ 
tional  constraints,  we  show  improved  performance  (see 
Table  1). 

5  Results 

In  this  section,  we  present  results.  Eor  the  second  phase, 
we  evaluate  the  quality  of  the  phrase  predictor.  The  re¬ 
sult  first  evaluates  the  phrase  classifier,  given  the  perfect 
phrase  locations  without  using  inference  (i.e.  T{V^)  = 
V^).  The  second,  adds  inference  to  the  phrase  classifica¬ 
tion  over  the  perfect  classifiers  (see  Table  2).  We  evaluate 
the  overall  performance  of  our  system  (without  assum¬ 
ing  perfect  phrases)  by  training  and  evaluating  the  phrase 
classifier  on  the  output  from  the  first  phase  (see  Table  3). 

Einally,since  this  is  a  tagging  task,  we  compare  this 
system  with  the  basic  tagger  that  we  have,  the  CLCL 

^where  [®]  is  1  if  a:  is  true  and  0  otherwise 

^(Xpress-MP,  2003)  was  used  in  all  experiments  to  solve  in¬ 
teger  linear  programming. 


Precision 

Recall 

FI 

D*  Phase,  non-Overlap 

70.54% 

61.50% 

65.71 

D'  Phase,  All  Const. 

70.97% 

60.74% 

65.46 

2^0  Phase,  non-Overlap 

69.69% 

64.75% 

67.13 

2"^  Phase,  All  Const. 

71.96% 

64.93% 

68.26 

Table  1:  Summary  of  experiments  on  the  development  set. 
The  phrase  scoring  is  choosen  from  either  the  first  phase  or  the 
second  phase  and  each  is  evaluated  by  considering  simply  non¬ 
overlapping  constraints  or  the  full  set  of  linguistic  constraints. 
To  make  a  fair  comparison,  parameters  were  set  seperately  to 
optimize  performance  when  using  the  first  phase  results.  All 
results  are  for  overall  performance. 


Precision 

Recall 

FI 

Without  Inference 

86.95% 

87.24% 

87.10 

With  Inference 

88.03% 

88.23% 

88.13 

Table  2:  Results  of  second  phase  phrase  prediction  and  in¬ 
ference  assuming  perfect  boundary  detection  in  the  first  phase. 
Inference  improves  performance  by  restricting  label  sequences 
rather  than  restricting  structural  properties  since  the  correct 
boundaries  are  given.  All  results  are  for  overall  performance 
on  the  development  set. 

shallow  parser  from  (Punyakanok  and  Roth,  2001),  which 
is  equivalent  to  using  the  scoring  function  from  the  first 
phase  with  only  the  non-overlapping  constraints.  Table  1 
shows  how  how  additional  constraints  over  the  standard 
non-overlapping  constraints  improve  performance  on  the 
development  set”*. 

6  Conclusion 

We  show  that  linguistic  information  is  useful  for  semantic 
role  labeling  used  both  to  derive  features  and  to  derive 
hard  constraints  on  the  output.  We  show  that  it  is  possible 
to  use  integer  linear  programming  to  perform  inference 
that  incorporates  a  wide  variety  of  hard  constraints  that 
would  be  difficult  to  incorporate  using  existing  methods. 
In  addition,  we  provide  further  evidence  supporting  the 
use  of  scoring  phrases  over  scoring  phrase  boundaries  for 
complex  tasks. 
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